Recognizing Image Features and Patterns

HKUST CSIT5401 Recognition System lecture notes 1. 识别系统复习笔记。

Laplacian Point Detector
Line Detector
Edge Detectors
- Gradient Operator
- Marr-Hildreth Edge Detector
Scale Invariant Feature Transform (SIFT)
Harris Corner Detector

Laplacian Point Detector

There are three different types of intensity discontinuities in a digital image:

Point (Isolated Point)
Line
Edge (Ideal, Ramp and Roof)

Intensity discountinuity is detected based on the mask response \(R\) within a pre-defined window, e.g. \(3\times3\): \[ R=\sum_{i=1}^9w_iz_i \] where \(w_i\) represent weights within a pre-defined window; \(z_i\) represent intensity values.

If \(|R|≥T\) , then a point has been detected. This point is the location on which the mask is centred, where \(T\) is a non-negative threshold.

The mask below is the Laplacian mask for detecting point. Sum of all weights is zero to make sure that there is no response at a flat region (constant intensity region).

1	1	1
1	-8	1
1	1	1

The Laplacian is given as (\(f\) is input image): \[ \triangledown^2f(x,y)=\frac{\partial^2f}{\partial x^2}+\frac{\partial^2 f}{\partial y^2} \] where \[ \frac{\partial^2f}{\partial x^2}=f(x+1,y)-2f(x,y)+f(x-1,y)\\ \frac{\partial^2 f}{\partial y^2}=f(x,y+1)-2f(x,y)+f(x,y-1) \] The discrete implementation of the Laplacian operator is given as: \[ \triangledown^2f(x,y)=f(x+1,y)+f(x-1,y)+f(x,y+1)+f(x,y-1)-4f(x,y) \] Below are several Laplacian masks:

Line Detector

A line is detected when more than one aligned, connected points are detected or, the response of line mask is greater than some threshold. The below are line masks for detecting lines (1 pixel thick) in 4 different specific directions:

If we want to detect a line in a specified direction, then we should use the mask associated with that direction and threshold its output responses.

If 4 line masks are used, then the final response is equal to the largest response among the masks: \[ R = \max\left(|R_{horizontal}|, |R_{45}|, |R_{vertical}|, |R_{-45}|\right) \] Example code (Matlab):

1
2
3

f = imread('xxx.png'); % read image
w = [2 -1 -1; -1 2 -1; -1 -1 2]; % mask
g = abs(imfilter(double(f),w)); % mask responses

Edge Detectors

Edge is the boundary of regions. The boundary has meaningful discontinuities in grey intensity level. There are three types of edges: ideal edge (left), ramp edge (middle) and roof edge (right).

For an ideal edge (step edge, left), an edge is a collection of connected pixels on the region boundary. Ideal edges can occur over the distance of 1 pixel.

Roof edges (right) are models of lines through a region, with the base (width) of a roof edge being determined by the thickness and sharpness of the line. In the limit, when its base is 1 pixel wide, a roof edge becomes a 1 pixel thick line running through a region in an image. Roof edges can represent thin features, e.g., roads, line drawings, etc.

For a ramp edge (middle):

edge point is any point contained in the ramp
edge length is determined by the length of the ramp
the slope of the ramp is inversely proportional to the degree of blurring in the edge
the first derivative of the intensity profile is positive at the points of transition into and out of the ramp (we move from left to right)
the second derivative of the intensity profile is positive at the transition associated with the dark side of the edge, and negative at the transition associated with the light side of the edge

The magnitude of the first derivative can be used to detect the presence of an edge. The sign of the second derivative can be used to determine whether an edge pixel lies on the dark or light side of an edge. The zero-crossing property of the second derivative is very useful for locating the centres of thick edge.

However, fairly little noise can have a significant impact on the first and second derivatives used for edge detection in images. Image smoothing is commonly used prior to the edge detection so that the estimations of the two derivatives can be more accurate.

Gradient Operator

The computation of the gradient of an image is based on obtaining the partial derivatives \(G_x = \partial f/\partial x\) and \(G_y = \partial f/\partial y\) at every pixel location (x,y). The gradient direction and gradient magnitude are \[ \tan^{-1}\left(\frac{G_y}{G_x}\right)\\ |\triangledown f|\approx|G_x|+|G_y| \]

The are several ways to approximate the partial derivatives:

Roberts cross-gradient operators
- \(G_x = (z_9-z_5)\)
- \(G_y=(z_8-z_6)\)
Prewitt operators
- \(G_x=(z_7+z_8+z_9)-(z_1+z_2+z_3)\)
- \(G_y=(z_3+z_6+z_9)-(z_1+z_4+z_7)\)
Sobel operators
- \(G_x=(z_7+2z_8+z_9)-(z_1+2z_2+z_3)\)
- \(G_y=(z_3+2z_6+z_9)-(z_1+2z_4+z_7)\)

Below are diagonal edge masks for detecting discontinuities in the diagonal directions.

Marr-Hildreth Edge Detector

The Laplacian of an image f(x,y) at location (x,y) is defined as \[ \triangledown^2f(x,y)=\frac{\partial^2f}{\partial x^2}+\frac{\partial^2 f}{\partial y^2} \] There are two approximations: \[ \triangledown^2f=4z_5-(z_2+z_4+z_6+z_8)\\ \triangledown^2f=8z_5-(z_1+z_2+z_3+z_4+z_6+z_7+z_8+z_9) \] The Laplacian generally is not used in its original form for edge detection (based on zero-crossing property) because it is unacceptably sensitive to noise. We smooth the image by using a Gaussian blurring function \(G(r)\), where \(r\) = radius, before we apply the Laplacian operator: \[ G(r)=\exp(\frac{-r^2}{2\sigma^2}) \] The Laplacian of a Gaussian (LoG) operator is defined by \[ \text{LoG}(f)=\triangledown^2(f\ast G)=f\ast(\triangledown^2G) \] Using the LoG, the location of edges can be detected reliably based on the zero-crossing values because image noise level is reduced by the Gaussian function.

Marr-Hildreth algorithm

Filter the input with an n-by-n Gaussian blurring filter \(G(r)\).
Compute the Laplacian of the image resulting from Step 1 using one of the following 3-by-3 masks.
Find the zero crossings of the image from Step 2.
1. A zero crossing at a pixel implies that the signs of at least two of its opposing neighboring pixels must be different.
2. There are four cases to test: left/right, up/down, and the two diagonals.
3. For testing, the signs of the two opposing neighboring pixels must be different and their absolute Laplacian values must be larger than or equal to some threshold.
4. If yes, we call the current pixel a zero-crossing pixel.

Scale Invariant Feature Transform (SIFT)

SIFT is useful for finding distinctive patches (keypoints) in images and transforming keypoints (locations) into features vectors for recognition tasks.

SIFT consists of four steps:

Scale-space extrema detection
Keypoint localization
Orientation assignment
Keypoint descriptor

SIFT feature is

invariant to image rotation and scale
partially invariant to change in illumination and 3D camera viewpoint

The method is a cascade (one step followed by the other step) filtering approach, in which more computationally expensive operations are applied only at locations that pass an initial test.

[1] Scale-space extrema detection

Candidate locations are identified by searching for stable features across all scales in the scale space. The scale space of an image is defined as \[ L(x,y,\sigma)=G(x,y,\sigma)\ast I(x,y)\\ G(x,y,\sigma)=\frac{1}{2\pi\sigma^2}\exp(-\frac{x^2+y^2}{2\sigma^2}) \] , where \(\ast\) is the convolution operator (filtering operation), the variable-scale Gaussian is \(G(x, y, \sigma)\) and the image is \(I(s, y)\).

The difference between two nearby scales separated by a constant multiplicative factor \(k\) is \[ \begin{align} D(x,y,\sigma)&=L(x,y,k\sigma)-L(x,y,\sigma)\\ &=\left(G(x,y,k\sigma)-G(x,y,\sigma)\right)\ast I(x,y) \end{align} \] The DoG function provides a close and efficient approximation to the scale-normalized Laplacian of Gaussian (LoG).

Local extrema (maxima and minima) of \(D( x, y, σ)\) are detected by comparing the values of its 26 neighbors, including 9 from the above image, 9 from bottom image and 8 from the current image.It is selected only if the center pixel is larger than all of these neighbors (26 neighbors) or smaller than all of them.

[2] Keypoint localization

Once a keypoint candidate has been found by comparing a pixel to its neighbors, the next step is to determine whether the keypoint is selected based on local contrast and localization along edge. Therefore, the keypoints will be rejected if these points have low contrast.

All extrema are discarded if \(|D(x)|<0.03\) (minimum contrast), where \(x\) represents a relative image position with maximum value of \(D\). The keypoints will be rejected if these points are poorly localized along an edge (determined based on the ratio of principle curvatures).

[3] Orientation assignment

Each corresponding keypoint in \(L\) is assigned a dominant orientation. The keypoint descriptor can be represented relative to this orientation and therefore achieve invariance to image rotation.

The gradient magnitude and orientation are \[ m(x,y)=\sqrt{(L(x+1,y)-L(x-1,y))^2+(L(x,y+1)-L(x,y-1))^2}\\ \theta(x,y)=\tan^{-1}\left(\frac{L(x,y+1)-L(x,y-1)}{L(x+1,y)-L(x-1,y)}\right) \] An orientation histogram is formed from the gradient orientations within a region around the keypoint. The orientation histogram has 36 bins covering the 360 degree range of orientations, 10 degrees per bin. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window (with σ that is 1.5 times that of the scale of the keypoint, and it is related to effective window size). Peak in the orientation histogram corresponds to dominant direction of the local gradients.

[4] Descriptor for local image region

For each detected keypoint, a local image descriptor is computed. It is partially invariant to change in illumination and 3D viewpoint.

In order to achieve orientation invariance, the coordinates and the gradient orientations are rotated relative to the keypoint orientation. A Gaussian weighting function with σ equal to one half the width of the descriptor window is used to assign a weight to the magnitude of each sample point.

Each subregion generates an orientation histogram with 8 orientation bins. Therefore, for each keypoint, if \(2\times2\) descriptor is used, a feature vector can be formed with \(2\times2\times8 = 32\) elements; if \(4\times4\) descriptor is used, there will be \(4\times4\times8 = 128\) elements in a feature vector. The optimal setting is 4x4 subregions and 8 orientation bins (the above picture).

Feature matching

Matlab Implementation

% num = match(image1, image2)
%
% This function reads two images, finds their SIFT features, and
%   displays lines connecting the matched keypoints.  A match is accepted
%   only if its distance is less than distRatio times the distance to the
%   second closest match.
% It returns the number of matches displayed.
%
% Example: match('scene.pgm','book.pgm');
function num = match(image1, image2)
% Find SIFT keypoints for each image
[im1, des1, loc1] = sift(image1);
[im2, des2, loc2] = sift(image2);
% For efficiency in Matlab, it is cheaper to compute dot products between
%  unit vectors rather than Euclidean distances.  Note that the ratio of
%  angles (acos of dot products of unit vectors) is a close approximation
%  to the ratio of Euclidean distances for small angles.
%
% distRatio: Only keep matches in which the ratio of vector angles from the
%   nearest to second nearest neighbor is less than distRatio.
distRatio = 0.6;
% For each descriptor in the first image, select its match to second image.
des2t = des2';
for i = 1 : size(des1,1)
    dotprods = des1(i,:) * des2t; % compute orientation of des1[i] and des2[j] for all j
    [vals, indx] = sort(acos(dotprods)); % Take inverse cosine and sort results
    % Check if nearest neighbor has angle less than distRatio times 2nd.
    if (vals(1) < distRatio * vals(2))
      match(i) = indx(1);
    else
      match(i) = 0;
    end
end

Harris Corner Detector

Basic Idea

We should easily recognize the point by looking at intensity values within a small window. hifting the window in any direction should yield a large change in appearance.

Harris corner detector gives a mathematical approach for determining which case holds.

Change of intensity for the shift \([u, v]\): \[ E(u,v)=\sum_{x,y}w(x,y)[I(x+u,y+v)-I(x,y)] \] where \(w(x,y)\) is window function.

For nearly constant patches, this will be near 0. For very distinctive patches, this will be larger. Hence... we want patches where \(E(u,v)\) is LARGE.

For small shifts \([u, v]\), we have a bilinear approximation: \[ E(u,v)\approx [u,v]M[u,v]^T \] where \(M\) is a 2x2 matrix computed from image derivatives: \[ M=\sum_{x,y}w(x,y)\begin{bmatrix} I_x^2 & I_xI_y \\ I_xI_y & I_y^2 \end{bmatrix} \] Treat gradient vectors as a set of \((dx,dy)\) points with a center of mass defined as being at \((0,0)\). Fit an ellipse to that set of points via scatter matrix.

Measure of corner response: \[ R=Det(M)-k(Trace(M))^2 \] where \[ Det(M)=\lambda_1\lambda_2\\ Trace(M)=\lambda_1+\lambda_2 \] According to \(R\), we can detect corner:

\(R\) depends only on eigenvalues of \(M\). As we can see from figure above,

\(R\) is large for a corner.
\(R\) is negative with large magnitude for an edge
\(|R|\) is small for a flat region

Harris Corner Detection Algorithm

Compute \(x\) and \(y\) derivatives of image: \[ I_x=G_\sigma^x\ast I\\ I_y=G_\sigma^y\ast I \]
Compute products of derivatives at every pixel: \[ I_x^2=I_x\cdot I_x\\ I_y^2=I_y\cdot I_y\\ I_{xy}=I_x\cdot I_y \]
Compute the sums of the products of derrivatives at each pixel: \[ S_x^2=G_\sigma'\ast I_x^2\\ S_y^2=G_\sigma'\ast I_y^2\\ S_{xy}=G_\sigma'\ast I_{xy} \]
Define at each pixel \((x,y)\) the matrix: \[ M(x,y)=\begin{bmatrix} S_x^2(x,y) & S_{xy}(x,y) \\ S_{xy}(x,y) & S_y^2(x,y) \end{bmatrix} \]
Compute the response of the detector at each pixel: \[ R=Det(M)-k(Trace(M))^2 \]
Threshold of value of \(R\). Compute nonmax suppression.